Worksheet 2b

Importing, exploring and describing data


Timetable week: 5
Topic: "Measurement and Description"

Learning outcomes

By thee end of the session you will know how to:

  • Install and load packages in R
  • Import data into your R environment
  • Explore variables of different types using descriptive statistics
  • Create basic descriptive graphs to visualise your data

Introduction

In Worksheet 2a you have had a look at the RStudio environment and have created an R script (script.R or Lab_2.R) and a Quarto markdown document (Lab_2.qmd). In Worksheet 1 you have also explored some survey data sources and read through survey documentation to gain an understanding of how sociological concepts - such as “trust” - are being measured and operationalised in empirical research. The exercises on this worksheet will begin bringing these two activities together by exploring the original survey data using R.

Open the Lab_2.R script file and start working there. At the end, once you have completed all the exercises, the final task will be to transfer some of the code from the R script into the Lab_2.qmd markdown document, add some more description to what the code is supposed to achieve, and test if you have done it all correctly by rendering it to HTML or Microsoft Word.

Exercise B1: R functions and user-written packages

About 15 minutes


Most work in R is done using functions. The most common operations involving a function take the following generic form (think of an analogy of baking a loaf of bread):

It’s possible to create your own functions. This makes R extremely powerful and extendible. But instead of programming our own functions, we can rely on functions written by other people and bundled together into packages designed to perform some specific (or sometimes many very general) tasks.

There are a large number of reliable, tested and oft-used packages containing functions that are particularly useful for social scientists. In this module, we will rely on several such user-written packages that extend the basic packages already bundled in with our R software (the so-called base-R packages and functions).

Most mature packages are available from the Comprehensive R Archive Network (CRAN) or private repositories such as Bioconductor and GitHub. Packages made available on CRAN can be installed using the command install.packages("packagename"). Once the package/library is installed (i.e. it is sitting somewhere on your computer), we then need to load it to the current R session using the command library(packagename).

Install the tidyverse collection of packages and read about the packages and functions they contain

Type the following commands into your R script and execute each line of code:

Here’s a one-minute video crash-course on what you are expected to do:

We can check the list of packages that are loaded with the tidyverse library using a command from the tidyverse itself:

With the library() function, we are loading the entire “library” of functions and data from the given package (or suite of packages). If we know that we will only be using one or two functions just once or twice from a package in our session, we could alternatively just use the function we need without loading the entire library, prepending the function name with the package name and two consecutive colons (packagename::functionname()):

tidyverse::tidyverse_packages()

I will sometimes use this form in the worksheets to clarify what package a function originates from, even if the package is loaded in the library.

Now that you’ve installed a package, write the required functions in your R scripts to install and load the following packages that we will also be using a lot:

You can read more about the packages we have just installed here:

Exercise B2: Importing data

About 15 minutes


So far we have learnt about some useful functions for installing and loading R packages. We can now look at functions that can be used to load a dataset and make it available for analysis. R’s native data format carries the extension .rds. But we can import data stored in other formats too, such as the generic comma separated values (.csv) format or the various formats used by other proprietary statistical analysis packages (e.g. SPSS, Stata, SAS).

The typical survey data that we use most commonly in the social sciences is usually distributed in one of the proprietary formats mentioned, because those have been designed specifically to manage “labelled” data (i.e. variables that need to have longer descriptive labels and consist of many categorical variables whose categories/levels only make sense if they are themselves meaningfully “labelled”).

We saw from the exercises in Worksheet 1 that the The World Values Survey (WVS) provides the greatest number of options for data download formats, including .rds. All other surveys provide their data either in .csv, .sav (SPSS) or .dta (Stata). My advice is to download the .sav (SPSS) version of the data whenever possible, just because SPSS allows the longest variable and value labels and therefore that data labelling may be the most complete.

Traditionally, one of the most severe shortcomings of R compared to other proprietary statistical analysis packages has been it’s very limited and cumbersome support of labelling. This has changed significantly over the past few years, and currently a suite of packages developed or contributed to by sociologist Daniel Lüdecke of the University of Hamburg are among the best available tools for this purpose - including the packages making up the easystats ecosystem, which we have just installed in the previous exercise.

Access to survey data

In these lab sessions we will be using real-world data from some of the surveys that you learnt about in Lab 1. However, real-life data - even structured survey data - can be messy and too large.

To make it easier for you to focus on the more substantive aspects of methodological work in this module, you will be able to access lightly pre-formatted and reduced versions of those data-sets. You will be able to download the SOC2069-version of the data from Canvas for the purpose of working with them for this module.

However, if you will want to use survey data in the future, you will need to download the original raw versions of the datasets form the original sources following a free registration process.

You can read about the data that you have available here: https://soc2069.chrismoreh.com/data/data_main

Let’s load the World Values Survey, Wave 7 dataset.

First, download the dataset from Canvas and save it in the “Data” subfolder of the R Project folder you created in Worksheet 2a.

Once the dataset is downloaded, you can import it into R. Because the downloaded data file is in the .rds format, you can navigate to the file either from within RStudio’s Files pane, or manually in Windows Explorer, and double-click the file to open. However, this is not a recommended way of opening a file, because you want to record the action in your R script so you have a record of where the data came from.

It’s easy to import the file with a function command, but you will need to identify the path to the file first. If the data file is in the same folder as your R script/working directory, you only need to specify the data file’s name to import it. However, if the data is in a different directory, you will need to specify either a relative path from the script to the data file, or an absolute path from the computer’s home directory.

You can load the dataset either with the readRDS() command - which is R’s native command for loading .rds file, or you can use a more generic function that you can use later to load data in various other formats, such as the data_read() command from easystats’s datawizard package. On my computer, the command and link to the file would be the following:

wvs7data <- datawizard::data_read("D:/GitHub/SOC2069/Data/for_analysis/wvs7.sav")
Reading data...
Variables where all values have associated labels are now converted into
  factors. If this is not intended, use `convert_factors = FALSE`.
285 out of 289 variables were fully labelled and converted into factors.

In the code above, I am assigning (wih the assignment operator <-) the data to an object I called “wvs7data”, but I could have given it any other name as long as it follows R’s naming conventions (no spaces, avoid special characters; see also recommended naming strategies in the tidyverse style guide). The object will appear in the Environment tab (on the bottom right pane).

There are some complications if you copy/paste your folder path on a Windows PC. My path above would have copied as D:\GitHub\SOC2069\Data\for_analysis, but R does not recognise back-slashes as path elements because in R the backslash has a special meaning. Yo must either replace backslashes with forward slashes or use double-backslashes (D:\\GitHub\\SOC2069\\Data\\for_analysis).

Another option is to use the function readClipboard() to save the path to an object:

mypath <- readClipboard()

The path is saved with double backslashes, so you can then just paste the file name to it in the function that loads the data. The paste0() function concatenates the items you mention without leaving empty spaces between them:

wvs7data <- data_read(paste0(mypath, "/wvs7.sav"))

Working with paths, folders, directories is a sustainable and robust way can be a challenge. The here package provides some useful options if you are working within R project directories. You can read more about the package in the “R for Social Scientists” course.

We can look at the dataset object in the Environment pane. If we click on the blue button with the white arrow before the name of the object, a list of variables and other information about them will roll down. If we click on the object’s name or info, the dataset will open in the Sources pane, just next to the R script file. This is equivalent to having run the following command:

View(wvs7data)    

Note the capital “V”; R is case-sensitive, so always pay attention; view(wvs7data) won’t work.

You can explore the dataset a bit. Only the first 50 columns (i.e. variables) are displayed, to see the next 50 you can click on arrow (>) in the dataset window’s toolbar. Once you’ve had a quick look, you can close that view or return to the R script.

We have already read about the WVS7 survey in Lab 1 and have had a look at the survey website. The codebook for the dataset is available at: https://soc2069.chrismoreh.com/data/data_main#world-values-survey-wave-7-wvs7

Have a look through the dataset description and variable list to get a good sense of the data. As you did last week, locate the variables that refer to “trust” and make a note of the variable names.

Exercise B3: Basic descriptive statistics

About 30 minutes


Let’s learn a few functions that can help us explore variables through basic descriptive statistics.

Tabulate categorical variables

Most of the variables in the dataset are categorical, so tabulating the frequency distributions of their categories can come handy. There are various ways for doing this in R, but one of the most convenient options is to use the data_tabulate() function from the datawizard package.

Let’s look, for instance, at the “generalised social trust” variable, Q57:

wvs7data |> data_tabulate(Q57)

Let’s decipher the code above:

  • first, we assume that there is an object called ‘wvs7data’ in our Environment that contains the wvs7 dataset; unless we restarted RStudio or deleted it, the object should still be there from the previous exercise;

  • the |> operator (called a pipe) allows objects and functions to be fed (i.e. piped) forward to other functions. Getting into the habit of using the “pipe” workflow is particularly useful as it makes combining a series of operations/commands easier to read and follow. Here, we take our dataset and pipe it forward so that we can easily refer to variables from that dataset (such as Q57);

  • the remaining code applies the data_tabulate function to the variable Q57

We could have written the command in these ways, with the same result:

data_tabulate(wvs7data, Q57)

data_tabulate(wvs7data$Q57)

What we learn here is that another way of referring to variables within datasets is by using the $ operator. Here, wvs7data**$**Q57 extracts the Q57 variable/column from the wvs7data object.

We will aim to follow the “piped” workflow whenever possible, because it’s more flexible to expand and reads more logically to humans.

The resulting frequency table will be printed in the Console, and will look something like this:

Most people can be trusted (Q57) <categorical>
# total N=94278 valid N=93005

Value                      |     N | Raw % | Valid % | Cumulative %
---------------------------+-------+-------+---------+-------------
Most people can be trusted | 22552 | 23.92 |   24.25 |        24.25
Need to be very careful    | 70453 | 74.73 |   75.75 |       100.00
<NA>                       |  1273 |  1.35 |    <NA> |         <NA>

Question

Interpret the table:

  • How many of the respondents in the data think that “most people can be trusted”?
  • What is the percentage of those who are less trusting of others?
  • Are there any missing values (NA) on this variable?

Summarize numeric variables

There are much fewer numeric variables in this dataset (and in sociological datasets in general). One that we have is age (Q262). The base-R function summary is enough to get a basic summary of a quantitative/numeric type variable. Base R functions, however, don’t always work with a pipe ( |> ) workflow; summary doesn’t, so we must use the $ operator:

summary(wvs7data$Q262)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  16.00   30.00   41.00   43.41   56.00  103.00     510 

Question

Interpret the table:

  • What is the average (mean) age of the respondents in the dataset?
  • What about the median age?
  • How spread out is the age of the respondents?
  • What is the minimum and maximum age of the respondents?
  • Are there any missing values (NAs) on this variable?

Visualising distributions

It is often useful to make a graph to visualise a variable, especially a numeric variable. A figure can often convey information in a much more efficient way that a statistic/number. One of the most useful graph types for visualising numeric variables is a histogram. Again, there are many functions available in R for producing graphs. Once you become more proficient and use R more often, it is very useful to learn the graphing workflow of the ggplot2 package (included in the tidyverse), which builds up a graph canvas step by step using various elements. That allows the creation of highly customized figures of publishable quality.

For quick graphs, base R also have good plotting functionality with simple commands. A histogram of age could be called with:

hist(wvs7data$Q262)

The ggformula package can be a useful choice if you want to combine the ease of expanding the graphs in more complex ways with a command structure that will become more familiar once we start modelling relationships between variables.

All the plot calls in the ggformula package start with gf_, and they require a “formula” style input. The command would be:

gf_histogram( ~ Q262, data = wvs7data)
Warning: Removed 510 rows containing non-finite values (`stat_bin()`).

# or #

wvs7data |> gf_histogram( ~ Q262)
Warning: Removed 510 rows containing non-finite values (`stat_bin()`).

This function has a similar structure to the functions that we will use for statistical modelling, generically of the form goal(y ~ x, data = my_data). In our case, we are only “modelling” one variable here, so the first (y-axis) object before the tilde (~) is missing. The data can be specified with the option data = ..., or in a “piped” workflow.

Density plots are also useful for this purpose:

wvs7data |> gf_density( ~ Q262)
Warning: Removed 510 rows containing non-finite values (`stat_density()`).

Categorical variables are best visualised using bar charts. For example, a bar chart of the “generalised trust” variable:

wvs7data |> gf_bar( ~ Q57)

We can also add a categorical variable to break down a histogram or density plot by groups. For instance, we can use the “generalised trust” variable to visualise it by age. To do this, we add the categorical variable after a vertical bar (|):

wvs7data |> gf_density( ~ Q262 | Q57)
Warning: Removed 510 rows containing non-finite values (`stat_density()`).

Exercise B4: On your own

Make sure that you have copied all the commands from this page to your R script and that you have saved the script so you can access the commands at a later date.

In your own time, go on and select a few more variables from the dataset and explore them using the appropriate descriptive statistics using the functions we have practiced above.

Think about what the results tell you. Finally, transfer the commands that you practiced here into th Lab_2.qmd document, setting up code chunks appropriately, organising the code blocks into relevant sections under sub-titles, and adding some explanatory text in between. When you’re done, click on the “Render” button and check the resulting output. Will you manage to generate an output without errors?

Back to top